Spooky Author Identification Dataset

In this article, we would like work on a natural language processing (NLP) project. In doing so, we use a dataset from Kaggle.com.

The dataset contains text from works of fiction written by spooky authors of the public domain: Edgar Allan Poe, HP Lovecraft and Mary Shelley. The data was prepared by chunking larger texts into sentences using CoreNLP's MaxEnt sentence tokenizer, so you may notice the odd non-sentence here and there. Your objective is to accurately identify the author of the sentences in the test set.

File descriptions

Data fields

Problem Description

We like to develop a model to recognize/predict the author of a text.

Training and testing sets

First off, let's define $X$ and $y$ sets. We can use sklearn.preprocessing.LabelEncoder to encode the author from the Data.

StratifiedKFold is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.

TF-IDF features

Moreover, we would use sklearn.feature_extraction.text.TfidfVectorizer to convert Text data to a matrix of TF-IDF features.

Count Vectorizer

An alternative approach would be using sklearn.feature_extraction.text.CountVectorizer to convert Text data to a matrix of token counts.

Modeling: Logistic Regression

We can use sklearn.linear_model.LogisticRegression (using TF-IDF features).

Logistic Regression using TF-IDF features

Logistic Regression using Count Vectorizer

Predictions

Using the model, we can now predict the text available from the Pred (Test) dataset.